Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time.
The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!
First things first, let's get some terminology straight.
.ipynb file. These are pretty special, also known as Jupyter notebooks. Jupyter notebooks have a few special properties that make it ideal for work with data:
print()x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
print(x) # Run this cell after running the one above, and again after running the one below
x = 42
Anything you can do in Python, you can do here!
def UltimateQuestion(computer_name):
return computer_name + ' is thinking...'
UltimateQuestion('Deep Thought')
We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values
We'll rename these as pd and np, just so its easier to refer to later on
import pandas as pd
import numpy as np
For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.
For example, a CSV file could look something like...
tree_number, species_name, address
312, Magnolia grandiflora, 2828 Divisadero St
124, Melaleuca quinquenervia, 485 Union St
912, Pittosporum undulatum, 47 Vicksburg St
To import this, let's use the pd.read_csv() function:
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'
trees = pd.read_csv(url)
Here, we've saved the data to a dataframe object named trees
type(trees)
DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do
Let's take a look at the data. We'll use the function .head() to read in the first 5 rows
trees.head()
How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)
trees.shape
Let's try to understand our data a bit better.
trees.species_name.nunique()
trees.common_name.value_counts()
Show the biggest trees by sorting the dataframe:
Note: dbh records diameter of the tree base
trees.sort_values(by='dbh', ascending=False)
Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:
We can filter rows from a dataframe based on some condition
Cherry Plum treestrees[trees.common_name == 'Cherry Plum']
How would you show only trees north of Golden Gate Park (latitude > 37.77285)
Hint: Same way as matching if statements in python, mirroring the syntax above
trees[trees.latitude > 37.77285]
What is the average diameter of the Evergreen Pear tree?
trees[trees.common_name == 'Evergreen Pear'].dbh.mean()
trees.groupby(by='common_name').agg('mean')['dbh'].sort_values(ascending=False).head(20)
First things first, let's import the package to help us visualize the data, plotly.
If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.
import plotly.express as px
Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps
Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell
px.scatter?
trees_sample = trees.sample(frac=.2)
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show('notebook')
Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters
fig = px.scatter(trees_sample, x='date', y='dbh',
opacity=.15, color='site_location',
hover_name='common_name', hover_data=['site_location','site_type','address'],
marginal_x = 'histogram', marginal_y = 'histogram',
color_discrete_sequence = px.colors.qualitative.Prism[4:],
labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 'date':'Date Recorded'}
)
fig.show('notebook')
The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', mapbox_style="stamen-terrain", zoom=11,
color='site_location', size='dbh', opacity=.3,
color_discrete_sequence=['orange','red','orange','orange','orange','orange'],
hover_name='address',hover_data=['site_location','caretaker'],
labels={'site_location':'Site Location', 'dbh':'Tree Diameter',
'date':'Date Recorded', 'caretaker':'Care Taker'}
)
fig.show('notebook')